continuous-time model
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > Middle East > Jordan (0.05)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Continuous-time Models for Stochastic Optimization Algorithms
We propose new continuous-time formulations for first-order stochastic optimization algorithms such as mini-batch gradient descent and variance-reduced methods. We exploit these continuous-time models, together with simple Lyapunov analysis as well as tools from stochastic calculus, in order to derive convergence bounds for various types of non-convex functions. Guided by such analysis, we show that the same Lyapunov arguments hold in discrete-time, leading to matching rates. In addition, we use these models and Ito calculus to infer novel insights on the dynamics of SGD, proving that a decreasing learning rate acts as time warping or, equivalently, as landscape stretching.
Combining Recurrent, Convolutional, and Continuous-time Models with Linear State Space Layers
Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency. We introduce a simple sequence model inspired by control systems that generalizes these approaches while addressing their shortcomings. The Linear State-Space Layer (LSSL) maps a sequence $u \mapsto y$ by simply simulating a linear continuous-time state-space representation $\dot{x} = Ax + Bu, y = Cx + Du$. Theoretically, we show that LSSL models are closely related to the three aforementioned families of models and inherit their strengths. For example, they generalize convolutions to continuous-time, explain common RNN heuristics, and share features of NDEs such as time-scale adaptation. We then incorporate and generalize recent theory on continuous-time memorization to introduce a trainable subset of structured matrices $A$ that endow LSSLs with long-range memory.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > Middle East > Jordan (0.05)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Denmark > Capital Region > Kongens Lyngby (0.14)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
OT-Transformer: A Continuous-time Transformer Architecture with Optimal Transport Regularization
Kan, Kelvin, Li, Xingjian, Osher, Stanley
Transformers have achieved state-of-the-art performance in numerous tasks. In this paper, we propose a continuous-time formulation of transformers. Specifically, we consider a dynamical system whose governing equation is parametrized by transformer blocks. We leverage optimal transport theory to regularize the training problem, which enhances stability in training and improves generalization of the resulting model. Moreover, we demonstrate in theory that this regularization is necessary as it promotes uniqueness and regularity of solutions. Our model is flexible in that almost any existing transformer architectures can be adopted to construct the dynamical system with only slight modifications to the existing code. We perform extensive numerical experiments on tasks motivated by natural language processing, image classification, and point cloud classification. Our experimental results show that the proposed method improves the performance of its discrete counterpart and outperforms relevant comparing models.
- North America > United States > Washington > King County > Seattle (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (3 more...)
Reviews: Continuous-time Models for Stochastic Optimization Algorithms
I have read the rebuttal and I believe the authors have satisfactorily addressed my comments on prior work, so I have increased my rating. The SDE approximation method is well-established. Moreover, Minibatch SGD's continuous approximation has been considered by several prior works, e.g. Summary and review comments: The paper is well-written and one of its strengths in generally good comparison with prior work. The main theoretical results are: * SDE approximation for minibatch SGD and SVRG * Well-posedness of the SDEs * Matching convergence bounds using Lyapunov functions * Interpreting time-dependent adjustments as time-change and landscape-stretching.
Reviews: Continuous-time Models for Stochastic Optimization Algorithms
The paper presents an SDE approximation of mini-batch stochastic gradient descent and stochastic variance reduction gradient descent, two widely used methods, and they derive convergence rates. It presents a nice (i.e., not revolutionary, but still of interest to the community) result that fits within this area. Reviewers have a few suggestions for clarifications/improvements.
Continuous-time Models for Stochastic Optimization Algorithms
We propose new continuous-time formulations for first-order stochastic optimization algorithms such as mini-batch gradient descent and variance-reduced methods. We exploit these continuous-time models, together with simple Lyapunov analysis as well as tools from stochastic calculus, in order to derive convergence bounds for various types of non-convex functions. Guided by such analysis, we show that the same Lyapunov arguments hold in discrete-time, leading to matching rates. In addition, we use these models and Ito calculus to infer novel insights on the dynamics of SGD, proving that a decreasing learning rate acts as time warping or, equivalently, as landscape stretching.
Combining Recurrent, Convolutional, and Continuous-time Models with Linear State Space Layers
Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency. We introduce a simple sequence model inspired by control systems that generalizes these approaches while addressing their shortcomings. The Linear State-Space Layer (LSSL) maps a sequence u \mapsto y by simply simulating a linear continuous-time state-space representation \dot{x} Ax Bu, y Cx Du . Theoretically, we show that LSSL models are closely related to the three aforementioned families of models and inherit their strengths. For example, they generalize convolutions to continuous-time, explain common RNN heuristics, and share features of NDEs such as time-scale adaptation.